Abundances of metabolites in a data matrix usually have a right skewed distribution. Therefore, an appropriate transformation is needed to obtain a more symmetric distribution. The metabolomics literature have discussed various transformations such as log, cubic and square root as ways of handling these, most of which belong to the family of power transformations. However, the log transformation is usually adequate for statistical purposes (De Livera, Olshansky, and Speed 2013).
It is important to reduce the number of missing values as much as possible by using an effective pre-processing procedure. For example, a secondary peak picking method can be used for LC-MS data to fill in missing peaks which are not detected and aligned. Depending on the nature of missing data, either the kth nearest neighbour algoritm (Troyanskaya et al. 2001) or replacing the missing values by half the minimum of the data matrix is often used in metabolomics.
## -> Checking features...
## -> Checking samples...
The log transformed data matrix can then be explored using various plots.
RLA (Relative Log Abundance) plots are a good Way of visualising the data. Consistent sized and centered boxes are desirable. See (De Livera et al. 2012 De Livera et al. (2015)) for several examples.
A similar plot can be used to explore metabolites.
PCA plots can be used to identify any outlying samples and to get a preliminary understanding of the structure of the data.
Using a dendrogram to visualize clusters in the data.
Using a heat map where samples and metabolites are sorted according to their respective dendrograms.
Normalization methods presented in this package are divided into four categories, as those which use (i) internal, external standards and other quality control metabolites (Sysi-Aho et al. 2007, Redestig et al. (2009), De Livera et al. (2012), De Livera et al. (2015), Gullberg et al. (2004)) (ii) quality control samples (Dunn et al. 2011), (iii) scaling methods (Scholz et al. 2004, Wang et al. (2003)), and (iv) combined methods (Kirwan and Broadhurst (2013)). A brief summary of these methods are presented in Table 1 of (De Livera et al. 2015).
These approaches use internal, external standards and other quality control metabolites. These include the is method which uses a single standard (Gullberg et al. 2004), the ccmn (cross contribution compensating multiple internal standard) method (Redestig et al. 2009), the nomis (normalization using optimal selection of multiple internal standards) method (Sysi-Aho et al. 2007), and the remove unwanted variation methods (Gagnon-Bartsch, Jacob, and Speed 2014) as applied to metabolomics using “ruv2” (De Livera et al. 2012), “ruvrand” and “ruvrandclust” (De Livera et al. 2015). Note that ruv2 is an application specific method designed for identifying biomarkers using a linear model that adjusts for the unwanted variation component.
This function is based on the quality control sample based robust LOESS (locally estimated scatterplot smoothing) signal correction (QC-RLSC) method as described by Dunn et al. (2011) and impletemented statTarget (Luan 2017). Notice that for this approach log transforms the data after normalization.
The scaling normalization methods (Scholz et al. 2004, Wang et al. (2003)) included in the package are normalization to a total sum, normalisation by the median or mean of each sample, and are denoted by sum, median, and mean respectively. The method ref normalises the metabolite abundances to a specific reference vector such as the sample weight or volume.
In some circumstances, researchers use a combination of the above normalizations (i.e., one method followed by another).
Here the following normalization methods were performed.
## Normalised using nomis ....Done
##
## Normalised using ruvrand ....Done
##
## Normalised using median ....Done
Volcano plots can be used to assess the impact of normalizing on positive and negative control metabolites. See (De Livera et al. 2012), (De Livera, Olshansky, and Speed 2013), and (De Livera et al. 2015) for details.
Use histograms to compare the distribution of the p-values obtained from the fitted linear model. If there are no differentially abundant metabolites present in the data set, the distribution of p-values should be uniform between zero and one. Hence, with the presence of some differentially abundant metabolites, a histogram of p-values should be uniformly distributed but with a peak close to zero (De Livera et al. 2015).
Using RLA plots, the residuals obtained from the fitted linear model can be explored. These boxplots should have a median close to zero and low variation around the median.
Using venn plots, the consistency between results from different platforms can be assessed. In what follows, we simply compare the results from different normalisation methods.
Use RLA plots,
Using PCA plots,
## -> Performing PCA...
Use RLA plots to explore the component of unwanted variation removed by normalization.
Use the RLA and PCA plots above.
Use the RLA and PCA plots above.
De Livera, Alysha M, M. Aho-Sysi, Laurent Jacob, J. Gagnon-Bartch, Sandra Castillo, J.A. Simpson, and Terence P. Speed. 2015. “Statistical methods for handling unwanted variation in metabolomics data.” Analytical Chemistry 87 (7). American Chemical Society: 3606–15. doi:10.1021/ac502439y.
De Livera, Alysha M, Daniel A Dias, David De Souza, Thusitha Rupasinghe, James Pyke, Dedreia Tull, Ute Roessner, Malcolm McConville, and Terence P Speed. 2012. “Normalizing and integrating metabolomics data.” Analytical Chemistry 84 (24): 10768–76. doi:10.1021/ac302748b.
De Livera, Alysha M, Moshe Olshansky, and Terence P Speed. 2013. “Statistical analysis of metabolomics data.” Methods in Molecular Biology (Clifton, N.J.) 1055 (January): 291–307. doi:10.1007/978-1-62703-577-4_20.
Dunn, Warwick B, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, et al. 2011. “Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry.” Nature Protocols 6 (7): 1060–83.
Gagnon-Bartsch, Johann A, Laurent Jacob, and Terence P. Speed. 2014. Removing unwanted variation from high dimensional data with negative controls. IMS Monographs. Accepted for publication.
Gullberg, Jonas, Pär Jonsson, Anders Nordström, Michael Sjöström, and Thomas Moritz. 2004. “Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry.” Analytical Biochemistry 331 (2): 283–95. doi:10.1016/j.ab.2004.04.037.
Kirwan, JA, and DI Broadhurst. 2013. “Characterising and correcting batch variation in an automated direct infusion mass spectrometry (DIMS) metabolomics workflow.” Analytical and …, 5147–57. doi:10.1007/s00216-013-6856-7.
Luan, Hemi. 2017. “statTarget: R package.”
Redestig, Henning, Atsushi Fukushima, Hans Stenlund, Thomas Moritz, Masanori Arita, Kazuki Saito, and Miyako Kusano. 2009. “Compensation for systematic cross-contribution improves normalization of mass spectrometry based metabolomics data.” Analytical Chemistry 81 (19): 7974–80.
Scholz, M, S Gatzek, a Sterling, O Fiehn, and J Selbig. 2004. “Metabolite fingerprinting: detecting biological features by independent component analysis.” Bioinformatics (Oxford, England) 20 (15): 2447–54. doi:10.1093/bioinformatics/bth270.
Sysi-Aho, Marko, Mikko Katajamaa, Yetukuri Laxman, and Matej Oresic. 2007. “Normalization method for metabolomics data using optimal selection of multiple internal standards.” BMC Bioinformatics 8 (January): 93. doi:10.1186/1471-2105-8-93.
Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. “Missing Value Estimation Methods for Dna Microarrays.” Bioinformatics 17 (6). Oxford University Press: 520–25.
Wang, Weixun, Haihong Zhou, Hua Lin, Sushmita Roy, Thomas A Shaler, Lander R Hill, Scott Norton, Praveen Kumar, Markus Anderle, and Christopher H Becker. 2003. “Quantification of proteins and metabolites by mass spectrometry without isotopic labeling or spiked standards.” Analytical Chemistry 75 (18): 481848–26.